Main analysis (Exploratory Data Analysis)
Our data exploration starts by first looking at variables that might help explain the delay:
- Date
- Day of the Week
- Distance
- Destination Airport
- Carrier
Naturally, we then proceed by looking at the following variables in an effort to determine if delays can be amended for:
- Arrival Delay
- Airline
- Destination Airport
- Destination State
Day of the Month
We initiate our Exploratory Data Analysis by examining the Delays in Departure time for every day of the month. The following scatterplot helps us determine whether a relationship can be inferred.
monthly_data <- mutate(df, month = format(as.Date(df$date,format="%Y-%m-%d"), "%m"),
day = format(as.Date(df$date,format="%Y-%m-%d"), "%d"))
monthly_data <- select(monthly_data, date, dep_delay, month, day)
monthly_data$month[monthly_data$month == 10] <- "October"
monthly_data$month[monthly_data$month == 11] <- "November"
monthly_data$month[monthly_data$month == 12] <- "December"
monthly_data$month = factor(monthly_data$month, levels=c("October","November","December"))
monthly_delay <- filter(monthly_data, df$dep_delay > 0)
monthly_early <- filter(monthly_data, df$dep_delay <= 0)
ggplot() +
geom_point(data = monthly_delay, aes(y = dep_delay, x = day), alpha = .5, color = "#774184", stroke = 0) +
geom_point(data = monthly_early, aes(y = dep_delay, x = day), alpha = .8, color = "#fce640", stroke = 0) +
facet_grid(month~.) +
ylab("Delay in Departure (Minutes)") +
xlab("Days of the Month") +
ggtitle("Departure Delay correlated with Days of the Month",
subtitle = "Limited between 12 hours delay and 2 hours early departure") +
labs(caption = "Source: Final Project Dataset")
Clearly we can observe that during the Christmas holidays, departure delay shows an increase in duration, thus suggesting that passengers spend a lot more time waiting in airports to depart for their final destination. The same conclusion can be derived for the weekend following Thanksgiving (Thursday the 23 of November), which again constitutes an important holiday for the US. Departure delays during the Thanksgiving period are at a high on Sunday, when most people are returning home. Finally, we can observe that certain days of the week, like Tuesdays or Wednesdays, have reduced Departure Delay times. This is expected since the main reasons for travelling during the week are business related.
Day of the Week
How are flights distributed throughout a week? Which day is most popular for travelling?
num_flights <- df %>% group_by(weekday) %>% summarize(count = n())
num_flights$perc <- num_flights$count / sum(num_flights$count)
num_flights$weekday <- factor(num_flights$weekday, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(num_flights, aes(weekday, perc)) +
geom_col(fill = '#774184' ) +
ggtitle('Percentage of Flights on Each Day') +
xlab("Day of the Week") +
ylab("Percentage of Flights") +
labs(caption = "Source: Final Project Dataset")
Flights seem to be evenly distributed over all days except Saturdays. This makes sense, as travellers with flights on Saturdays are likely to be on vacation - and are consequently more likely to depart for their destination on Friday. How do delays relate to days of the week?
df$weekday <- factor(df$weekday, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(df, aes(x=weekday, y=dep_delay)) + geom_boxplot(varwidth=TRUE, fill = '#774184') +
ggtitle('Distribution of Delay on Each Day') +
xlab("") +
ylab("Delay in Minutes") +
labs(caption = "Source: Final Project Dataset")
The range of our dataset makes it hard to visualize the distribution effectively and identify any patterns. One way of addressing this is taking the log of our y-axis. Of course, logging negative values would result in erroneous results so we’ll focus on positive delays for now.
ggplot(subset(df, df$dep_delay >0), aes(x=weekday, y=log(dep_delay))) + geom_boxplot(varwidth=TRUE, fill = '#774184') +
ggtitle('Distribution of Positive Delay on Each Day') +
xlab("") +
ylab("log(Delay in Minutes)") +
labs(caption = "Source: Final Project Dataset")
Judging by the plot, delays are generally lower during Tuesdays, Wednesdays and Thursdays and pick up during the weekend. We assume this is due to the fact that most flights during those days are business related and also earlier on in the day, which as shown later on also plays a factor.
df$pos_delay <- ifelse(df$dep_delay > 0, 'Positive', 'Negative')
ggplot(df, aes(x=weekday, y=log(abs(dep_delay)))) + geom_boxplot(aes(fill=factor(pos_delay)), varwidth=TRUE) +
scale_y_continuous(name = 'log(Delay in Minutes)') + xlab('Day of the Week') + guides(fill=guide_legend(title="Delay")) +
scale_fill_manual(values=c('#4E8441', '#774184')) +
labs(caption = "Source: Final Project Dataset")
More flights depart early than late, but positive delays have greater range. Early departures also don’t seem to vary by day.
Distance
How does distance relate to Departure Delays?
ggplot(df_backup, aes(x = dep_delay, y = distance)) +
geom_point(alpha = .3, color = "#441152", stroke = 0) +
xlab("Delays in Departure (Minutes)") +
ylab("Distance (Miles)") +
ggtitle("Departure Delays in relation to the travelled Distance") +
labs(caption = "Source: Final Project Dataset")
Plotting the scatterplot of the Distance in correlation with the Departure Delay, helps us determine the following:
- The different values of the delays recorded for a specific distance, seem to form a line.
- Also, there is a specific interval of delay that should become our main focus.
To further investigate the latter claim, we produce the following two graphs.
ggplot(df_backup, aes(x = dep_delay, y = distance)) +
geom_point(alpha = .3, color = "#4E8441", stroke = 0) +
xlim(-720,720) +
xlab("Delays in Departure (Minutes)") +
ylab("Distance (Miles)") +
ggtitle("Departure Delays in relation to the travelled Distance",
subtitle = "Limited in 12 hours delays - early departures") +
labs(caption = "Source: Final Project Dataset")
ggplot(df_backup, aes(x = dep_delay, y = distance)) +
geom_point(alpha = .3, color = "#441152", stroke = 0) +
xlim(-180,180) +
xlab("Delays in Departure (Minutes)") +
ylab("Distance (Miles)") +
ggtitle("Departure Delays in relation to the travelled Distance",
subtitle = "Limited in 3 hours delays - early departures") +
labs(caption = "Source: Final Project Dataset")
Now, we can clearly bound the times of the departure delays that we are going to examine between a maximum of 12 hours delay and a maximum of 2 hours early departure. This follows from the fact that a 3 hour delay or early departure interval (third graph) does not constitute a clear representation of the phenomenon that we are exploring. On the contrary, a 12 hours delay interval is more than enough to give us the full picture. When combined with the 2 hours early departure barrier we have a clear picture of the situation at hand.
Upon further investigation of the distance variable, we produced a plot of the unique values of distances traveled to reach each state.
states_recode <- c("1" = "AK", "2" = "HI", "3" = "PR", "4" = "VI", "5" = "TT", "11" = "CT", "12" = "ME", "13" = "MA", "14" = "NH", "15" = "RI", "16" = "VT", "21" = "NJ", "22" = "NY", "23" = "PA", "33" = "FL", "34" = "GA", "35" = "MD", "36" = "NC", "37" = "SC", "38" = "VA", "39" = "WV", "41" = "IL", "42" = "IN", "43" = "MI", "44" = "OH", "45" = "WI", "51" = "AL", "52" = "KY", "53" = "MS", "54" = "TN", "61" = "IA", "62" = "KS", "63" = "MN", "64" = "MO", "65" = "NE", "66" = "ND", "67" = "SD", "71" = "AR", "72" = "LA", "73" = "OK", "74" = "TX", "81" = "AZ", "82" = "CO", "83" = "ID", "84" = "MT", "85" = "NV", "86" = "NM", "87" = "UT", "88" = "WY", "91" = "CA", "92" = "OR", "93" = "WA")
distance_state_data <- df %>% select( dest_state, distance ) %>% group_by( dest_state, distance ) %>% summarize( count = n() )
distance_state_data <- distance_state_data %>% mutate( state = states_recode[as.character(dest_state)])
ggplot(distance_state_data, aes(x = state, y = distance)) +
geom_point(alpha = .8, color = "#441152", stroke = 0) +
xlab("Destination (State)") +
ylab("Distance (Miles)") +
theme(axis.text.x = element_text(size = 5)) +
ggtitle("Distance in relation to the State Destination",
subtitle = "Unique Distances traveled to reach corresponding State destination") +
labs(caption = "Source: Final Project Dataset")
Additionally, we create a graph of the distances that each Airline claims to have covered.
distance_carrier_data <- df %>% select( carrier, distance ) %>% group_by( carrier, distance ) %>% summarize( count = n() )
ggplot(distance_carrier_data, aes(x = carrier, y = distance)) +
geom_point(alpha = .8, color = "#4E8441", stroke = 0) +
coord_flip() +
xlab("US Airline") +
ylab("Distance (Miles)") +
ggtitle("Distance in relation to the Airline") +
labs(caption = "Source: Final Project Dataset")
Based on the two graphs presented, we are able to observe that values of the distance variable are more discrete than they were expected to be. Logically speaking, we should have a greater variety of values since it is extremely rare for two planes to cover the exact same distance going back and forth to the same destination. By going back to the source of our data, we were able to discover that distances provided by our dataset are approximations of the shortest distance between any two points on the surface of a sphere. Therefore, any two flights with the same start and end points have the same distances. This can be verified from the table below.
distance_table <- df %>% select(origin, dest, dest_state, distance) %>% group_by(origin, dest)
head(distance_table, 10)
## # A tibble: 10 x 4
## # Groups: origin, dest [10]
## origin dest dest_state distance
## <int> <int> <int> <int>
## 1 10140 14107 81 328
## 2 14107 10140 86 328
## 3 13930 14108 41 130
## 4 14108 13930 41 130
## 5 12217 13930 41 510
## 6 13930 12217 51 510
## 7 10397 13930 41 606
## 8 13930 10397 34 606
## 9 11823 13930 41 157
## 10 13930 11823 42 157
Destination Airport
df$dest %>% unique() %>% length() #How many airports are in the data?
## [1] 300
To begin exploring the data on destination, with 300 unique values, we first construct a dataframe grouped by destination and with features for number of flights, total departure delay and mean departure delay.
df_n <- df %>% group_by(dest) %>%
summarise(n = n(),
total_delay = sum(dep_delay),
avg_delay = total_delay / n) %>%
arrange(desc(n))
With the dataframe in hand, we will look at destinations with the most flights, more than 20 000 over the period.
df_n %>% filter(n > 20000) %>% ggplot() +
geom_point(aes(reorder(dest, avg_delay), avg_delay, size=n, fill=avg_delay), shape=21, alpha=.7) +
coord_flip() + scale_fill_viridis_c() +
xlab('destination') +
labs(caption = "Source: Final Project Dataset")
11618 Newark Liberty International stands out as having the highest average delay of the lot. 14771 San Francisco International has a smaller average delay than Newark, but it stands out among those airports with more than 40 000 flights. Apart from San Francisco these airports have average delays in the neightborhood of five minutes. It is also noticeable that none of the most frequent destinations have a negative average delay.
df_n %>% filter(avg_delay < 0) %>% ggplot() +
geom_point(aes(reorder(dest, avg_delay), avg_delay, size=n, fill=avg_delay), shape=21, alpha=.7) +
coord_flip() + scale_fill_viridis_c() +
xlab('destination') +
labs(caption = "Source: Final Project Dataset")
Here we looked at those destinations with negative average delays. According to our size scale, the frequency of flights to these airports is more than an order of magnitude lower than in the previous plot. Two airports stand out in terms of average delay, 10165 Adak Island, Alaska and 10754 Barrow, Alaska.
df_n %>% filter(avg_delay < 1 & avg_delay > -1) %>% ggplot() +
geom_point(aes(reorder(dest, avg_delay), avg_delay, size=n, fill=avg_delay), shape=21, alpha=.7) +
coord_flip() + scale_fill_viridis_c() +
xlab('destination') +
labs(caption = "Source: Final Project Dataset")
Among those airports with average delays within one minute of zero, the number of flights also tends to be small, no more than 3 000.
df_n %>% filter(avg_delay > 10) %>% ggplot() +
geom_point(aes(reorder(dest, avg_delay), avg_delay, size=n, fill=avg_delay), shape=21, alpha=.7) +
coord_flip() + scale_fill_viridis_c() +
xlab('destination') +
labs(caption = "Source: Final Project Dataset")
When we filter to those airports with average delays greater than ten minutes, airport 11618 Newark stand out as by far the most popular destination in this bracket. As with the negative delay visualization, there are a small number of outliers in terms of average delay:
- 13388 Mammoth Lakes, California
- 15295 Toledo, OH
- 13964 North Bend/Coos Bay, Oregon
- 10372 Aspen, Colorado
Mammoth Lakes and Aspen are both ski resort towns so it might be interesting to look at the average delays in the summer to see if there is a reversion to the mean during warmer months.
Carrier
Are certain airlines more likely to experience delays?
top_carriers <- df %>% group_by(carrier) %>% summarize(num_flights = n())
top_carriers <- top_carriers[order(-top_carriers$num_flights), ]
top_carriers <- head(top_carriers, 5)
last_carriers <- df %>% group_by(carrier) %>% summarize(num_flights = n())
last_carriers <- subset(last_carriers, last_carriers$num_flights > 100)
last_carriers <- last_carriers[order(last_carriers$num_flights), ]
last_carriers <- head(last_carriers, 5)
df$pop_carrier <- ifelse(df$carrier %in% top_carriers$carrier, "Busy", 0)
df$pop_carrier <- ifelse(df$carrier %in% last_carriers$carrier, "Not Busy", df$pop_carrier)
head(top_carriers)
## # A tibble: 5 x 2
## carrier num_flights
## <chr> <int>
## 1 WN 323499
## 2 DL 221833
## 3 AA 215292
## 4 OO 180196
## 5 UA 146044
The five most popular airlines are:
- WN: Southwest Airlines
- DL: Delta Airlines
- AA: American Airlines
- OO: Skywest Airlines
- UA: United Airlines
head(last_carriers)
## # A tibble: 5 x 2
## carrier num_flights
## <chr> <int>
## 1 VX 18077
## 2 HA 19826
## 3 F9 26981
## 4 NK 37966
## 5 AS 43910
The five least popular airlines are:
- VX: Virgin America
- HA: Hawaiian Airlines
- F9: Frontier Airlines
- NK: Spirit Airlines
- AS: Alaska Airlines
A reminder that this is only in regards to domestic flights departing New York.
ggplot(subset(df, df$pop_carrier != 0), aes(x=factor(carrier), y=log(dep_delay))) + geom_boxplot(varwidth=TRUE, fill = '#774184') +
ggtitle('Distribution of Positive Delay in the most and least frequent Airlines') +
facet_grid(. ~ pop_carrier) +
xlab("Airline") +
ylab("log(Delay in Minutes)") +
labs(caption = "Source: Final Project Dataset")
The departure delay does not seem to vary as much, implying that the frequency of an airline does not play a factor on departure delay. Upon closer inspection, Skywest is the airline with the highest median delay whereas Hawaiaan has the lowest. Southwest is the most popular out of the ten, with more than 3500 flights out of New York State every day. It is also the best performing airline out of the “Busy” group.
Judging from the plots presented so far, departure delays are more likely to be caused by factors like the time of departure and the day of the week rather than the popularity of the airline or the flight’s destination. However in the event of a delay prior to departure, is it a sign that the arrival delay is going to be even larger or will the pilot make-up for the delay by getting to the destination faster? Are long flights more likely to have a high delay make-up (difference between departure and arrival delay) or do they exacerbate the situation further?
Arrival Delay
How do departure delay and arrival delay relate? Is it possible that pilots make up for delays with a shorter flight, or is it a sign that the flight is going to take even longer?
options(scipen=1)
ggplot(df, aes(dep_delay, arr_delay)) +
geom_hex(bins = 50) +
scale_fill_gradientn(colours = viridis(5, direction = 1)) +
xlab('Departure Delay in Minutes') +
ylab('Arrival Delay in Minutes') +
ggtitle('Departure Delay in relation to Arrival Delay') +
guides(fill=guide_legend(title="Frequency")) +
labs(caption = "Source: Final Project Dataset")
The two variables appear to have a roughly linear relationship, especially at higher values. There are exceptions to this, where a large number of flights experience high arrival delay even though they had zero departure delay. This can be explained by flights that had trouble landing because destination airport landing strips were busy.
We construct a variable called diff_delay that measures the difference between departure delay and arrival delay. This will allow us to examine whether a departure delay is likely to be amended for.
df$diff_delay <- df$dep_delay - df$arr_delay
Airline
df$pos_makeup <- ifelse(df$diff_delay >= 0, "Positive", "Negative")
df$pos_makeup <- factor(df$pos_makeup)
ggplot(subset(df, df$pop_carrier != 0 & df$pos_makeup == 'Positive'), aes(x=factor(carrier), y=log(dep_delay))) + geom_boxplot(varwidth=TRUE, fill = '#774184') +
ggtitle('Distribution of Positive Makeup Delay in the busiest and least busy Airlines') +
facet_grid(. ~ pop_carrier) +
xlab("Airline") +
ylab("log(Positive Makeup Delay in Minutes)") +
labs(caption = "Source: Final Project Dataset")
Apparently, the size of an airline doesn’t have an impact on its ability to make up for a delay.
Destination State
We wish to explore any patterns relating departure delay and arrival delay. Does the distance between New York and the destination state influence a flight’s performance? What about the destination state’s population density?
# data frame for the cleveland plot and map
flight <- df
flight$dest_state <- factor(flight$dest_state)
region_code <- c("1", "2", "3", "4", "5", "11", "12", "13", "14", "15", "16", "21", "22", "23", "31", "33", "34", "35", "36", "37", "38", "39", "41", "42", "43", "44", "45", "51", "52", "53", "54", "61", "62", "63", "64", "65", "66", "67", "71", "72", "73", "74", "81", "82", "83", "84", "85", "86", "87", "88", "91", "92", "93", "31")
region_name <- c("Alaska", "Hawaii", "Puerto Rico", "U.S. Virgin Islands", "U.S. Pacific Trust Territories and Possessions", "Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania", "Delware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "West Virginia", "Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota", "Arkansas", "Louisiana", "Oklahoma", "Texas", "Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "California", "Oregon", "Washington", "Delaware")
flight$dest_state <- factor(flight$dest_state, region_code, region_name)
# tidy data for cleveland plot and map
flight_state <- flight %>%
group_by(dest_state) %>%
dplyr::summarize(sum_n = n(), sum_arr = sum(arr_delay), sum_dep = sum(dep_delay), sum_diff = sum(diff_delay)) %>%
mutate(mean_arr = sum_arr/sum_n, mean_dep = sum_dep/sum_n, mean_diff = sum_diff/sum_n)
# cleveland plot of arrival time and depature time
flight_state$dest_state <- reorder(flight_state$dest_state, flight_state$mean_arr)
ggplot(flight_state) +
geom_point(aes(mean_arr, dest_state, color = 'mean_arr'), size = 3) +
geom_point(aes(mean_dep, dest_state, color = 'mean_dep'), size = 3) +
labs(title = "Arrival and Departure Delay over States", x = "Average Delay in Minutes", y = "", color = '') +
theme(legend.position = 'bottom') +
scale_color_manual(name = '', values = c('mean_arr'= '#774184', 'mean_dep'='#4E8441'), labels=c("Arrival Delay", "Departure Delay"))
# generate arrival map format for choropleth
arr_map_form <- subset(flight_state, select = c('dest_state', 'mean_arr'))
names(arr_map_form)[1] <- paste('region')
names(arr_map_form)[2] <- paste('value')
arr_map_form$region <- as.character(arr_map_form$region)
arr_map_form$region <- tolower(arr_map_form$region)
# generate arrival map
arr_map <- StateChoropleth$new(arr_map_form)
arr_map$title = "Arrival Delay over States"
arr_map$show_labels = TRUE
arr_map$set_num_colors(1)
arr_map$ggplot_scale = scale_fill_viridis(name = 'Average Arrival Delay in Minutes')
arr_map$render()
Neither departure nor arrival delay are strongly associated with destination state. The five regions with highest arrival delays are New Jersey, Puerto Rico, the U.S Virgin Islands, Hawaii and New York. The five regions with lowest arrival delays are Virginia, Maryland, Utah, South Carolina and the U.S Pacific Trust Territories and Possessions.
We surprisingly found the average departure delay is higher than the average arrival delay in all states, no matter their location or population density.
# cleveland plot of time difference
flight_state$dest_state <- reorder(flight_state$dest_state, flight_state$mean_diff)
ggplot(flight_state) +
geom_point(aes(mean_diff, dest_state), color = '#774184', size = 3) +
labs(title = "Difference between Departure and Arrival Delay over States", x = "Average Difference in Minutes", y = "", color = '')
# generate diff map format for choropleth
diff_map_form <- subset(flight_state, select = c('dest_state', 'mean_diff'))
names(diff_map_form)[1] <- paste('region')
names(diff_map_form)[2] <- paste('value')
diff_map_form$region <- as.character(diff_map_form$region)
diff_map_form$region <- tolower(diff_map_form$region)
# generate diff map
diff_map <- StateChoropleth$new(diff_map_form)
diff_map$title = "Difference in Delay over States"
diff_map$show_labels = TRUE
diff_map$set_num_colors(1)
diff_map$ggplot_scale = scale_fill_viridis(name = 'Average Difference in Minutes')
diff_map$render()
To our surprise, flights to the Northwestern states (Washington, Oregon, Idaho, Montana, Wyoming, North Dakota and South Dakota) had a lower delay difference than the others. This contradicts our hypothesis that pilots can make up for time lost during longer flights. There doesn’t appear to be a relationship between destination and the difference between arrival and departure delay. In addition to the Northwest states, Alaska, Hawaii, U.S. Virgin Islands and Mississippi also have a low make-up delay.